wide and deep neural network
Critical Initialization of Wide and Deep Neural Networks using Partial Jacobians: General Theory and Applications
Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows one to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality.
Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks
We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that, the expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent (SGD) and random initialization can be bounded by the training loss of a random feature model induced by the network gradient at initialization, which we call a \textit{neural tangent random feature} (NTRF) model. For data distributions that can be classified by NTRF model with sufficiently small error, our result yields a generalization error bound in the order of $\tilde{\mathcal{O}}(n^{-1/2})$ that is independent of the network width. Our result is more general and sharper than many existing generalization error bounds for over-parameterized neural networks. In addition, we establish a strong connection between our generalization error bound and the neural tangent kernel (NTK) proposed in recent work.
Reviews: Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks
Originality: To the best of my knowledge, the results are novel and provide important extensions/improvements over the previous art. Quality: I did a high level check of the proofs and it seems sound to me. Clarity: the paper is a joy to read. The problem definition, assumptions, the algorithm, and statement of results are very well presented. Significance: the results provide several extensions and improvements over the previous work, including training deeper models, training all layers, training with SGD (rather than GD), and smaller required overparameterization.
Reviews: Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks
This paper provides a generalization bound for training over-parameterized deep neural networks with ReLU activation and cross-entropy loss using SGD. Initially the paper received mixed reviews, with two positive and one negative reviews. On the one hand, the analysis is found to be intuitive, general, and potentially influential, the generalization bound is found to be more general and sharper than many existing generalization error bounds for over-parameterized neural networks, and the paper to be very well written. On the other, hand the width requirement is found to be too strict. The rebuttal addressed the issues raised by the reviewers, one rating was increased from 6 to 8, and the negative review updated the score to 6. Upon discussion, the reviewers agreed that the paper should be accepted.
Critical Initialization of Wide and Deep Neural Networks using Partial Jacobians: General Theory and Applications
Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows one to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality.
Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks
We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that, the expected 0 - 1 loss of a wide enough ReLU network trained with stochastic gradient descent (SGD) and random initialization can be bounded by the training loss of a random feature model induced by the network gradient at initialization, which we call a \textit{neural tangent random feature} (NTRF) model. For data distributions that can be classified by NTRF model with sufficiently small error, our result yields a generalization error bound in the order of \tilde{\mathcal{O}}(n {-1/2}) that is independent of the network width. Our result is more general and sharper than many existing generalization error bounds for over-parameterized neural networks. In addition, we establish a strong connection between our generalization error bound and the neural tangent kernel (NTK) proposed in recent work.
Critical initialization of wide and deep neural networks through partial Jacobians: general theory and applications to LayerNorm
Doshi, Darshil, He, Tianyu, Gromov, Andrey
Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new way to diagnose (both theoretically and empirically) this criticality. To that end, we introduce partial Jacobians of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0
- North America > United States > Rhode Island (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks
We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that, the expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent (SGD) and random initialization can be bounded by the training loss of a random feature model induced by the network gradient at initialization, which we call a \textit{neural tangent random feature} (NTRF) model. For data distributions that can be classified by NTRF model with sufficiently small error, our result yields a generalization error bound in the order of $\tilde{\mathcal{O}}(n {-1/2})$ that is independent of the network width. Our result is more general and sharper than many existing generalization error bounds for over-parameterized neural networks. In addition, we establish a strong connection between our generalization error bound and the neural tangent kernel (NTK) proposed in recent work.
A Wide and Deep Neural Network for Survival Analysis from Anatomical Shape and Tabular Clinical Data
Pölsterl, Sebastian, Sarasua, Ignacio, Gutiérrez-Becker, Benjamín, Wachinger, Christian
We introduce a wide and deep neural network for prediction of progression from patients with mild cognitive impairment to Alzheimer's disease. Information from anatomical shape and tabular clinical data (demographics, biomarkers) are fused in a single neural network. The network is invariant to shape transformations and avoids the need to identify point correspondences between shapes. To account for right censored time-to-event data, i.e., when it is only known that a patient did not develop Alzheimer's disease up to a particular time point, we employ a loss commonly used in survival analysis. Our network is trained end-to-end to combine information from a patient's hippocampus shape and clinical biomarkers. Our experiments on data from the Alzheimer's Disease Neuroimaging Initiative demonstrate that our proposed model is able to learn a shape descriptor that augments clinical biomarkers and outperforms a deep neural network on shape alone and a linear model on common clinical biomarkers.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)